Compiler IPC Format

The compiler IPC format is the format of the data passed between the parser, the optimizer, and whichever backend is in use. It is a streamed format. The streaming concept is structured in a way similar to BACnet protocol. Which is to say structured or list-like information is bracketed by a 'begin' and 'end' of block tokens. There are different tokens for different kinds of data, and the tokens may be nested, but the end always matches the begin. Usually when a list is streamed, the count of items in the list is streamed first.

There are only a couple of kinds of data, index-like integers, pure integers, and floats being the main ones. Strings are streamed as a counted list of bytes.

Because the data is being streamed, it is possible to write an unstreamer that has the same overall structure as the streamer. Except that it restores internal data structures instead of writing them.

This documentation is mostly not going to be concerned with the exact numeric values of enumerations in the streamed data, preferring to refer to them by the name given in the source code.

At the lowest level, various structures such as the type structure and symbol structure are being streamed. In this document, we will be considering things in a general sense and won't delve into exact details of what is being streamed for each low-level structure.

1.0.0 Multiple input files

The streamed data is placed in shared memory. Since size is always implicitly known, it is possible to stream data for multiple input files successively into the same shared memory, concatening one immediately after the other. The handling for multiple input files is that simple, although in practice this is augmented by a list of input files that is found inside the streamed data. The remainder of this document will focus on the data found in just one of stream datas.

1.1.0 Overall Format

At the highest level, the following data is streamed:

1.2.0 Primitive data types

1.2.1 'index' data type

The first primitive data type is an 'index' data type. Index data types with a value between 0 and 0x7fff are rendered as two bytes (big endian); larger values are rendered as four bytes (big endian) but the high bit of the first byte is set. The fact that the high bit of the first byte is set makes this unsuitable for storing negative numbers.

1.2.2 Block tokens

It is always known whether to expect a block token, based on context. Block tokens are one byte wide, but the upper two bits are information as to whether it is the beginning or ending of a block so that means there are effectively a maximum of 64 block types. The block types are currently separated into two categories, which are determined based on context.

Block tokens have bit 6 set if it is the begin token, or bit 7 set if it is the end token.

An end token has to be given for each begin token.

1.2.3 'Int' data type

An integer is a 4-byte signed value given in big-endian order, bracketed by the STT_INT block token type.

1.2.4 'Float' data type

A floating point value is a stream of the internal floating point data structure, bracketed by the STT_FLOAT block token type.

There are 3 data structures streamed as indexes. These are:

the floating point type (e.g. zero, infinity, nan) the sign the exponent

Additionally, the mantissa is streamed as a sequence of indexes.

1.2.4 'String' data type

the length of the string is streamed as an index, followed by the string contents as a stream of bytes. It is intended that strings be represented in UTF-8.

1.2.5 'Lists'

When streaming a list, the first item streamed is an index of the count of items in the list.

The type of data in a list is always known by context, but for example when streaming structured data it will usually also be spelled out with block tokens around each list element.

After streaming the count of items, each list element is streamed in the order it exists in the list.

For example, a list of strings would start with a count of the number of strings, followed by count number of strings. Each string would also have a count, then data bytes for the string.

1.2.6 text

When text is encountered, for example for the name of a symbol or for line information, the text is placed into a separete 'text' section. When streaming such text, the offset into the text section is streamed in place of the actual text. The text section is streamed into the file last.

1.2.6 symbol indices

When a symbol reference is encountered other than in its definition, an index is assigned based on which table it is in. Based on symbol type or other information the table can be inferred by the unstreamer and the index looked up in the appropriate table. Note that tables for things like local variables are specific to the function definition being parsed.

The methods of assigning and resolving such indexes are beyond the scope of this document.

2.0.0 File header

The first thing in the file is the four byte string "LSIL". This is always checked to make sure there isn't a problem with any previously unstreamed data. Next comes a IPC version, streamed as an integer. In the odd case that a breaking change is made to the IPC format and mismatched compiler executables are used this should catch it.

Next comes the architecture being compiled for. There are currently two architectures:

ARCHITECTURE_MSIL ARCHITECTURE_X86

The architecture isn't really needed by the streaming, however, may be used for example in decision making in the optimizer.

3.0.0 Primary parameters

The primary parameters consists of a block of mostly boolean values that tell things like whether to generate assembly code, whether to generate an object file or call a linker, what levels of optimization should occur, whether to include browse or debug information, and so forth. These are conveniently collected into a structure called COMPILER_PARAMS in the front end. Streaming this simply streams the contents of the structure as-is, wrapped in SBT_PARAMS block tokens.

4.0.0 Secondary parameters

The secondary parameters are other administrative parameters collected in the front end, often by interpreting the command line. The entire list of parameters is bracketed by SBT_XPARAMS block tokens.

The following things are streamed here:

5.0.0 Globals table

This is a symbol list bracketed in the SBT_GLOBALSYMS block tokens.

Each symbol in the list consists of symbol data, bracketed by STT_SYMBOL block tokens.

The first data streamed in the symbol list is the storage class; if this is SCC_NONE no symbol is present at this point in the symbol list and none of the other symbol-related information is added.

Otherwise, relevant information such as symbol name, storage class, type, and other things is streamed out.

6.0.0 Externals table

This is a symbol list bracketed in the SBT_EXTERNALS block tokens.

Each symbol in the list consists of symbol data, bracketed by STT_SYMBOL block tokens.

The first data streamed in the symbol list is the storage class; if this is SCC_NONE no symbol is present at this point in the symbol list and none of the other symbol-related information is added.

Otherwise, relevant information such as symbol name, storage class, type, and other things is streamed out.

7.0.0 Types table

Structured types are bracketed by SBT_TYPES block tokens.

Each symbol in the list consists of symbol data, bracketed by STT_SYMBOL block tokens.

The first data streamed in the symbol list is the storage class; if this is SCC_NONE no symbol is present at this point in the symbol list and none of the other symbol-related information is added.

Otherwise, relevant information such as symbol name, storage class, type, and other things is streamed out.

This type of symbol in this table is sometimes a structure definition, which means the list of members also gets streamed out as a list of symbol indexes. Such ancilliary symbols are also defined in this table.

Here the primary information used by the symbol is its type.

7.1.0 Streaming a type

a type is streamed, bracketed by STT_TYPE block tokens. If the type is ST_NONE none of the other data is present. Otherwise information such as the basic type (int, short, char, etc) and various sizing information used by the optimizer and backend is present. If the type is a pointer it has a base type which also gets streamed, and so on until we hit a basic or structured type. This actually results in nested STT_TYPE tokens.

8.0.0 Browse information

Browse information consists of two lists. It may be present either if browse information or debug information is requested.

The first list, bracketed by SBT_BROWSEFILES block tokens, gives a list of files that were included by the current source file.

The second list is bracketed by SBT_BROWSEINFO block tokens, and holds specific information about location and types of symbols in the source file.

Each item in the SBT_BROWSEFILES list is bracketed by STT_BROWSEFILE block tokens. The information streamed here consists of the file name, and the file index used by the SBT_BROWSEINFO structures.

Each item in the SBT_BROWSINFO list is bracketed by STT_BROWSEINFO block tokens.

9.0.0 MSIL Properties

The compiler allows definition of properties for MSIL. These are streamed as a list of items in a block bracketed by SBT_MSILPROPS tokens. The list element for each item includes things like the getter/setter and propery definition symbol.

10.0.0 Streaming data

Inside a set of SBT_DATA block tokens is the data and code generated by the parser. The data consists of various things such as integer or floating point values, references to symbols, and so forth. It is enumerated at the lowest level; e.g. for a structure defined with a value there would be several data references, usually one per member unless the member is itself a structure or array and then those elements get enumerated.

For example the DT_UINT stream type is followed by a 4-byte integer value, while the DT_DEFINITION would be followed by a symbol index; it is defining a symbolic label in the text. DT_SYM would be followed by a symbol reference and would define data which is the reference to a symbol. And so forth. The backend can parse this data directly into an ASM file or into segments...

This document isn't going to detail every streamed type but it will go into the DT_FUNC type in some detail.

10.1.0 Streaming functions

Streaming a function consists of streaming various administrative details about the function. This information is kept and streamed on a per-function basis.

Some of the salient features we want to consider are the list of variables defined by the function; the list of temporaries which may have to be placed on the stack; the list of instructions and their operands; a list of flags for temporaries; and the list of load temporaries.

The lists of variables and temporaries are symbol tables similar to the global and external symbol tables; they are lists of symbols streamed in exactly the same manner as before. Note that the unstreamer can tell by context whether to use these tables rather than the global/external tables when interpreting symbol indexes.

Variables and temporaries can sometimes be referenced after the function body is processed, e.g. by exception table processing for the function. The exception information needs location about where on the stack to find variables to destroy.

Operands are oft-repeated and need to be kept the same for each use, so they are collected and streamed independently in a list. See section 10.2.0 for details.

Instructions are also streamed in a list. Note there are 'passthrough' instructions which amount to being target assembly language instructions passed through from the parser to the backend. This is somewhat tailored to the x86 at present but in future could be generalized to other architectures.

See section 10.3.0 for details on instructions

Section 10.4.0 details how a sparse list of temporary flags is stored.

Section 10.5.0 details how a sparse list of load temporaries is stored.

10.2.0 Operands

Intermediate language instructions are usually three-operand expressions, usually with one operand being assigned to and the other two operands being the inputs. For example an addition instruction has a target and two inputs that are added together.

This isn't always the case, with some instructions such as a parameter specification or gosub instruction only having one operand which isn't the target, and other instructions having a label instead of the target.
However, in general instructions are streamed as if they are an opcode with three operands.

Operands must be handled specially for the optimizer, for example an operand which refers to a variable 'i' directly must be exactly the same object as any other operand referring to that same variable (although if there are multiple variables 'i' in the same procedure each is a unique object).

Therefore operands are gathered from the instruction stream first and then each unique instance is streamed out. The instance index is used later when streaming instructions. This is called streaming the IMODE list in the code.

Each operand is bracketed by STT_OPERAND block tokens. The data for the operand includes an opcode, various administrative information, and then three optional expressions which correspond to information about the opcode. Usually only one expression is used, however, the INDIRECT operand mode allows expression of semi-complex indirect nodes such as the based addressing modes on the x86. This would include a base register, an index register, and a numeric/labeled offset in the most expressive case. Initially in the parser, the compiler only generates a single expression for each operand, but a later target-specific optimization can generate more complex sets of expressions.

10.2.1 Expressions

Expressions are used in the operands for the intermediate language instructions, and in the operands for streamed assembly language instructions. An expression consists of the expression type, and two optional sub-expressions. An expression with all its subexpressions and all their subexpressions combined is an expression tree.

For example an addition expression adds the 'left' expression to the 'right' expression and returns a result. an integer expression holds an integer value. A symbolic expression holds the address of some symbol, which can be either user-defined or compiler-defined. Symbolic expressions are further broken down by type, for example there is one for global variables, one for auto variables, one for variables that can be accessed by the program counter (e.g. subroutines) and one for temporary variables.

Each expression is streamed inline at the point at which it is referenced; there isn't a global table of expressions like used for other things. Each expression is bracketed by the STT_EXPRESSION block tokens. It consists of an expression type (mode), which may be se_none if no other data for the expression exists. This is followed by any value for the expression, e.g. an integer or floating point number, or a symbol reference. It is further followed by streaming the expressions of the 'left' 'right' and 'altdata' subexpressions in case the instruction is an add or subtract . The 'altdata' subexpression is used by the MSIL compiler.

10.2.2 Assembly Language Operands

An assembly language operand is similar in concept to an intermediate language operand, although it only has one expression. Base and index information for based addressing modes is kept in specific fields instead of being broken out into multiple expression nodes.

The assembly language operand isn't bracketed by anything. It consists of an operand type (mode), various administrative information, and an expression which is used to describe constants and memory locations. The expression tree used here is the same one as the expression tree used in the intermediate language operands and is streamed in the same manner.

10.2.3 Assembly Language Instructions

An assembly language instruction is specified by the user, and constists of an opcode, various administrative information, and up to three assembly language operands. Streaming it isn't bracketed. The actual values in this structure are dependent on the targeted instruction set, however, only x86 is currently supported.

10.3.0 Intermediate language Instructions

An intermediate language opcode is the output of the compiler. It consists of the opcode, various administrative information and up to three operands. Any of the operands can be references to internal 'temporary' variables. The temporary variables are compiler-generated placeholders for intermediate results not directly specified by the program. By the end of the optimization process, some will be removed, others will be in registers, and still others will be placed on the stack.

Other operands might be constants, variable addresesses, or variable values.

The intermediate language has instructions for things like arithmetic functions, conditional branches, gosubs and accompanying parameter pushes/adjustments, and block copy/clear instructions used in structure operations. It also includes a couple of advanced concepts used to ease the generation of C++ programs.

Intermediate language instructions are bracketed by the STT_INSTRUCTION block tokens.

The intermediate language is specified in a separate document.

10.4.0 Temporary flags

Some boolean information about temporaries has to be propagated from the front end to the optimizer, to allow various optimizations. This information is expected to be sparse compared to the number of temporaries typically in use. A list of temporaries is streamed, each item being accompanied with the boolean information.

10.5.0 Load Temporaries

A list of which temporaries essentially load from user-defined variables is valuable for some aspects of optimizations. This list is expected to be sparse compared to the number of temporaries. Each list element has two components: one is the number of the temporary, and the other is the imode index for the imode that the temporary loads from.